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(54) Method and system for speech recognition 

(57) A method and system for improving speech rec- 
ognition through front-end normalization of feature vec- 
tors are provided. Speech to be recognized is spoken 
into a microphone, amplified by an amplifier, and con- 
verted from an analog signal to a digital signal by an ana- 
log-to-digital ("A/D") converter. The digital signal from the 
A/D converter is input to a feature extractor that breaks 
down the signal into frames of speech and then extracts 
a feature vector from each of the frames. The feature vec- 
tor is input to an input normalizer that normalizes the vec- 
tor. The input normalizer normalizes the feature vector 
by computing a correction vector and subtracting the cor- 

r 



rection vector from the feature vector. The correction vec- 
tor is computed based on the probability of the current 
frame of speech being noise and based on the average 
noise and speech feature vectors for a current utterance 
and a database of utterances. The normalization of the 
feature vector reduces the effect of changes in the 
acoustical environment on the feature vector. The nor- 
malized feature vector is input to a pattern matcher that 
compares the normalized vector to feature models 
stored in the database to find an exact match or a best 
match. 
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Description 

Field of the invention 

5 This invention relates generally to speech recognition and, more particularly, to a method and system for improving 

speech recognition through front-end normalization of feature vectors. 

Background of the Invention 

io A variety of speech recognition systems have been developed. These systems enable computers to understand 
speech. This ability is useful for inputting commands or data into computers. Speech recognition generally involves two 
phases. The first phase is known as training. During training, the system "learns" speech by inputting a large sample of 
speech and generating models of the speech. The second phase is known as recognition. During recognition, the system 
attempts to recognize input speech by comparing the speech to the models generated during training and finding an 

is exact match or a best match. Most speech recognition systems have a front-end that extracts some features from the 
input speech in the form of feature vectors. These feature vectors are used to generate the models during training and 
are compared to the generated models during recognition. 

One problem with such speech recognition systems arises when there are changes in the acoustical environment 
during and between training and recognition. Such changes could result, for example, from changes in the microphone 

20 used, the background noise, the distance between the speaker's mouth and the microphone, and the room acoustics. 
If changes occur, the system may not work very well because the acoustical environment affects the feature vectors 
extracted from speech. Thus, different feature vectors may be extracted from the same speech if spoken in different 
acoustical environments. Since the acoustical environment will rarely remain constant, it is desirable for a speech rec- 
ognition system to be robust to changes in the acoustical environment. A particular word or sentence should always be 

25 recognized as that word or sentence, regardless of the acoustical environment in which the word or sentence is spoken- 
Some attempts to solve the problem of changes in the acoustical environment have focused on normalizing the input 
speech feature vectors to reduce the effect of such changes. 

One attempt to solve this problem is known as mean normalization. Using mean normalization, the input speech 
feature vector is normalized by computing the mean of alt the feature vectors extracted from the entire speech and 

30 subtracting the mean from the input speech feature vector using the function: 

- i(r ) = x (/)-±f>(/) 

35 

where x(f) is the normalized input speech feature vector, x(f) is the raw input speech feature vector, and n is the number 
of feature vectors extracted from the entire speech. 

Another attempt to solve this problem is known as signal-to-noise-ratio-dependent ("SNR-dependent") normaliza- 
40 tion. Using SNR-dependent normalization, the input speech feature vector is normalized by computing the instantaneous 
SNR of the input speech and subtracting a correction vector that depends on the SNR from the input speech feature 
vector using the function: 

x(0 = x(/)-y(5//R) 



where x(f) is the normalized input speech feature vector, x(Q is the raw input speech feature vector, and y(SNR) is the 
correction vector. The correction vectors are precomputed and stored in a look-up table with the corresponding SNR's. 

50 None of the prior attempts to solve the problem of changes in the acoustical environment during and between training 

and recognition have been very successful. Mean normalization allows the input speech feature vectors to be dynamically 
adjusted but is not very accurate because it only computes a single mean for all of the feature vectors extracted from 
the entire speech. SNR-dependent normalization is more accurate than mean normalization because it computes varying 
correction vectors depending on the SNR of the input speech but it does not dynamically update the values of the 

55 correction vectors. Therefore, a solution is needed that both is accurate and dynamically updates the values used to 
normalize the input speech feature vectors. 



2 



EP 0 694 906 A1 



Summary of the Invention 

One aspect of the present invention provides a method and system for improving speech recognition through front- 
end normalization of feature vectors. In a speech recognition system of the present invention, speech to be recognized 

5 is spoken into a microphone, amplified by an amplifier, and converted from an analog signal to a digital signal by an 
analog-to-digital ("A/D") converter. The digital signal from the A/D converter is input to a feature extractor that breaks 
down the signal into frames of speech and then extracts a feature vector from each of the frames. The feature vector is 
input to an input normaiizer that normalizes the vector. The normalized feature vector is input to a pattern matcher that 
compares the normalized vector to feature models stored in a database to find an exact match or a best match. 

10 The input normaiizer of the present invention normalizes the feature vector by computing a correction vector and 
subtracting the correction vector from the feature vector. The correction vector is computed based on the probability of 
the current frame of speech being noise and based on the average noise and speech feature vectors for a current 
utterance and the database of utterances. The normalization of feature vectors reduces the effect of changes in the 
acoustical environment on the feature vectors. By reducing the effect of changes in the acoustical environment on the 

is feature vectors, the input normaiizer of the present invention improves the accuracy of the speech recognition system. 

Brief Description of the Drawings 

Figure 1 is a block diagram illustrating a speech recognition system incorporating the principles of the present 
20 invention; 

Figure 2 is a high level flow chart illustrating the steps performed by an input normaiizer of the system of Figure 1 ; and 
Figures 3A and 3B collectively are a high level flow chart illustrating the steps performed in the normalization of 
feature vectors in the system of Figure 1 . 

25 Detailed Description of the Preferred Embodiment 

The preferred embodiment of the present invention provides a method and system for improving speech recognition 
through front-end normalization of feature vectors. The normalization of feature vectors reduces the effect of changes 
in the acoustical environment on the feature vectors. Such changes could result, for example, from changes in the 

30 microphone used, the background noise, the distance between the speaker's mouth and the microphone, and the room 
acoustics. Without normalization, the effect of changes in the acoustical environment on the feature vectors could cause 
the same speech to be recognized as different speech. This could occur because the acoustical environment affects 
the feature vectors extracted from speech. Thus, different feature vectors may be extracted from the same speech if 
spoken in different acoustical environments. By reducing the effect of changes in the acoustical environment on the 

35 feature vectors, the input normaiizer of the present invention improves the accuracy of the speech recognition system. 
Figure 1 illustrates a speech recognition system 10 incorporating the principles of the present invention. In this 
system, speech to be recognized is spoken into a microphone 12, amplified by an amplifier 14, and converted from an 
analog signal to a digital signal by an analog-to-digital ("A/D") converter 16. The microphone 12, amplifier 14, and A/D 
converter 16 are conventional components and are well-known in the art The digital signal from the A/D converter 16 

40 is input to a computer system 1 8. More specifically, the digital signal is input to a feature extractor 20 that extracts certain 
features from the signal in the form of feature vectors. Speech is composed of utterances. An utterance is the spoken 
realization of a sentence and typically represents 1 to 1 0 seconds of speech. Each utterance is broken down into evenly- 
spaced time intervals called frames. A frame typically represents 1 0 milliseconds of speech. A feature vector is extracted 
from each frame of speech. That is, the feature extractor 20 breaks down the digital signal from the A/D converter 16 

45 into frames of speech and then extracts a feature vector from each of the frames. In the preferred embodiment of the 
present invention, the feature vector extracted from each frame of speech comprises cepstral vectors. Cepstral vectors, 
and the methods used to extract cepstral vectors from speech, are well-known in the art. 

The feature vector is then input to an input normaiizer 22 that normalizes the vector. The normalization of the feature 
vector reduces the effect of changes in the acoustical environment on the feature vector. The normalized feature vector 

so is then input to a pattern matcher 24 that compares the normalized vector to feature models stored in a database 26 to 
find an exact match or a best match. The feature models stored in the database 26 were generated from known speech, 
if there is an acceptable match, the known speech corresponding to the matching feature model is output. Otherwise, 
a message indicating that the speech could not be recognized is output. Typical pattern matchers are based on networks 
trained by statistical methods, such as hidden Markov models or neural networks. However, other pattern matchers may 

55 be used. Such pattern matchers are well-known in the art. 

The steps performed by the input normaiizer 22 are shown in Figure 2. The input normaiizer 22 receives the feature 
vector x ; for the current frame j, where j is an index (step 210). In the preferred embodiment of the present invention, 
the feature vector comprises cepstral vectors. A cepstral vector is a set of coefficients derived from the energy in different 
frequency bands by taking the Discrete Cosine Transform ("DCT") of the logarithm of such energies. In the preferred 



3 



EP 0 694 906 A1 



embodiment, the feature vector comprises a static cepstral vector augmented with its first and second order derivatives 
with time, the delta cepstral vector and the delta-delta cepstral vector, respectively. Each cepstral vector comprises a 
set of thirteen cepstral coefficients. However, one of ordinary skill in the art will appreciate that cepstral vectors having 
a different number of cepstral coefficients may be used. Additionally, one of ordinary skill in the art will appreciate that 
5 other forms of feature vectors may be used. 

Next, the input normalizer 22 computes a correction vector r(xy) or ry using the function (step 212): 

r(x / )=p / (n / . 1 -n av5 )+(1-p / )(s / . l -s a ^) (Eq. 1) 

w where py is the a posteriori probability of the current frame /being noise, and Sy. 1 are the average noise and speech 
feature vectors for the current utterance, and n avg and s avg are the average noise and speech feature vectors for the 
database of utterances 26. The computation of n, s, n avg , and s avg will be discussed below. Lastly, the input normalizer 
22 computes a normalized feature vector Xy using the function (step 214): 

i,=x,-r(x,) (Eq. 2) 



While the feature vector comprises the three cepstral vectors discussed above, in the preferred embodiment of the 
20 present invention, only the static cepstral vector is normalized, the delta cepstral vector and the delta-delta cepstral 
vector are not normalized. 

The computation of the correction vector r(x y ) is simplified based on certain assumptions and estimations. First, 
assume that noise and speech follow a Gaussian distribution. Based on this assumption, the a posteriori probability of 
the current frame / being noise p y - is computed using the function: 



25 



p.= " 1 3 1 — ^ (Eq. 3) 



30 



where £ is the a pr/or/probability of the current frame /being noise, N[Xj,t\j. Xn^-jj) and M^Sy-i »£$r>i)) are the Gaussian, 
probability density functions ("pdf's") for noise and speech, respectively, and and TS(j.<\) arethecovariance matrices 
for noise and speech, respectively. The Gaussian pdfs for noise and speech, A/[xy,ny. 1 ,2:n^ 1 )) and /V^s^.Ls^-j)), are 
35 represented using the standard function for Gaussian pdfs: 

//(x J> n ; . l> Sn 0 . t) )= ^ 1 j^^PC-^^y - n /-i> r s -a-i)( x / " 

(2.T) *|In 0 -. l) | 

40 

(Eq. 4a) 



45 and 



50 



(Eq. 4b) 



55 where g is the dimension of x., exp is the exponential function, and T represents the transpose function. 
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Than, the a posteriori probability of the current frame / being noise p y is represented by the sigmoid function: 



l-rexp 



(Eq. 5) 



where 



"»CM> . 



I * J 



(Eq. 6) 



20 



where d(x ; ) or d y - is the distortion. The distortion is an indication of whether a signal is noise or speech. If the distortion 
is largely negative, the signal is noise; if the distortion is largely positive, the signal is speech; if the distortion is zero, 
the signal may be noise or speech. 

Second, assume that the components of x y are independent of one another. Based on this assumption, L n and L<. 
25 are modelled using diagonal covariance matrices a„ and respectively. Thus, d{*p is represented using the function: 



30 



35 



1-0 



r (x y [/]-n > .,[/]) 2 - (x.m-s^m)- 



+ ln 




+ 2ln 










m 



(Eq. 7) 



where q is the dimension of cr„ and Further, the most important factor in discriminating noise from speech is the 
power term (/=£?). Thus, d(xp is approximated using the function: 



40 



~ , .. (^[Ol-n^JQ]) 2 (x > [03-i / . l [03) 3 ^ 
~ d o(Xj) = " — ; — + ln 



< y . o [0](l-^ 



(Eq. 8) 



Next, the values of n, s, <y n , and £ are estimated using a modified version of the well-known Estimate-Maximize 
("EM") algorithm. The EM algorithm is discussed in N.M. Laird, A.R Dempster, and D.B. Rubin, "Maximum Likelihood 

so from Incomplete Data via the EM Algorithm," Annals Roval Statistical Society. 1 -38, December 1 967. The EM algorithm 
generates maximum likelihood estimates of the values by refining previous estimates based on new values. This algo- 
rithm uses a window function over which the estimates are refined. The window function defines the interval of time over 
which past estimates are used to refine the current estimates. The standard EM algorithm uses a rectangular window 
function. A rectangular window function gives equal weight to the data over the entire window. The modified version of 

55 the EM algorithm used in the preferred embodiment of the present invention uses an exponential window function. An 
exponential window function gives more weight to recent data in the window. Thus, the values of n, s, and £ are 
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estimated using the functions: 



10 



n /= ^ {Eq. 9) 

so 

S/ = -^2_ (£q. 10) 



15 



20 Z W *^>-* X /-* 

I**/'/-* 



<y>=^ (Eq. 11) 



i=0 



30 



35 



0"^=-^ s; (Eq. 12) 

oO 

Z w *^-* 

^ = - s=£ = (Eq- i3) 



where tv^ is the exponential window function. 
40 The exponential window function w k is represented by: 

w k =a k (Eq. 14) 

where a is a parameter that controls the rate of adaptation. The rate of adaptation determines how much weight is given 
45 to past data relative to the current data. The smaller a is, the less weight that is given to past data relative to the current 
data; the larger a is, the more weight that is given to past data relative to the current data. The value of a is computed 
using the function: 

50 a = (l/2) WTF ' (Eq. 15) 



where T is a time constant and F s is the sampling frequency of the A/D converter 16. In the preferred embodiment of 
55 the present invention, separate a's are used for noise and for speech. The use of separate a's allows noise and speech 
to be adapted at different rates. In the preferred embodiment in which separate a*s are used, a smaller a is used for 
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noise than for speech. Thus, the functions used to estimate the values of n, s, a n , and £ are reduced to: 



where 



= (Eq. 16) 



C »0) 



s, = (Eq. 17) 



b 



c «o> 
b. 



<r^=-^-s; (Eq. 19) 

^=(1-0.)^ (Eq. 20) 



a n(/)-P/X/+«,,a»(H> (Eq. 21) 

b ^) ss p/ c 5 +a - b «c/-«) (Eq - 22) 

C n W =P;+a„C n(/ . n) (Eq. 23) 

a s(/r( 1 -P,)x,+ a s a sC /.i) (Eq. 24) 

b J o)=0-P;)X;-a.b^- 1) (Eq. 25) 

45 

c sU)^'Pj) +a s c sU^) (Eq- 26) 

so where a n and a s are the parameters that control the rate of adaptation for noise and speech, respectively. The compu- 
tation of initial values for n, s. a n , a n , b n , c n , a^ and c s will be discussed below. 

The steps performed in the normalization of a feature vector are shown in Figures 3A and 3B. First, values for a n 
and a s are selected (step 310). The values for a n and a s are selected based on the desired rate of adaptation (as 
discussed above). Additionally, the value of j is set equal to zero (step 312) and initial values for n, s, a n , and £ are 

55 estimated (step 314). The initial values for n. s, o nt c^. and £ are estimated from the database of utterances 26 using 
standard EM techniques. 

n o=n avp (Eq.27) 
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S 0~ S avg 



(Eq. 28) 



10 



f t 



(Eq. 29) 
(Eq. 30) 



75 



(Eq-31) 



20 



25 



a - ?ZL. 



b «(0) - 



1 

1-cr 



(Eq. 32) 

(Eq. 33) 
(Eq. 34) 



30 



35 



40 



b ,(0) " 



J<<JVg) 



C .K0) - 



1-cr, 
1 



(Eq. 35) 

(Eq. 36) 
(Eq. 37) 



Then, the feature vector x y -for the current frame / is received (step 31 6). The distortion d. is computed using the 
function (step 318): 



4 = 



_ (x / [0]-n / .,[03) a {x^-^jof ] J oi, / ., ) [0Ul-tf '] 

(Eq. 38) 

i 
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The a posteriori probability of the current frame j being noise pj is computed using the function (step 320): 

1 



(Eq. 39i 



io The correction vector r ; is computed using the function (step 322): 

r/M-P/n,., M-n ^MHO-p^s^ [ q-s avg [ /]) 
for mThe normalized feature vector Xyis computed using the function (step 324): 

*>M = */M-r,[/] (Eq. 41) 



75 



(Eq. 40) 



20 



25 



30 



for /=0,1,...,/7) 

The values of n, s, cr n , a^, and ^ are updated using the functions (step 326); 



(Eq. 42} 

(Eq. 43) 

(Eq. 44) 

(Eq. 45) 



40 



45 



for /=0,1 m 

where 



f/(l=(i-« n )c n(y) [/| 



a nO)M«P/X/M+«,,a„0.i)M 



(Eq. 46) 



(Eq. 47) 



so 



(Eq. 48) 



55 



C n{j)-Pj +a n c n(J-1) 



(Eq. 49) 



a s0 -)[/)=(1-p y )x ; {/l+a s a ^.^[/l 



(Eq. 50) 
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b j0) [/] = (l-^)x;[/] + a,b jO _ l) [/] (Sq. 51) 

5 



10 



SO 



25 



30 



35 



c s(/)- (1 " p y^ +a s c s{/-i) (Eq. 52) 

1(( 4 st,y ' * e '"P* "o^a'izer 22 determines whether frame j is the last frame in the current utterance (step 326) 
f f ramey is not the last frame .n the current utterance, /is incremented (step 330) and steps 31 6 through 326 are repeated 
for the next frame. If frame / is the last frame in the current utterance, the input normalizer 22 determines whether the 
current utterance is the last utterance (step 332). If the current utterance is not the last utterance, /is reset to zero (step 
334). he values of n. s. a„ a, and ? are reset to the estimated initial values (step 336). and steps 316 through 326 are 
repeated for each frame in the next utterance. If the current utterance is the last utterance, the input normalizer 22 returns 
In order to reduce the computational complexity of the input normalizer 22 of the present invention, one of ordinary 
sk,ll in the art w.ll appreciate that several modifications could be made to the input normalizer. First, the last term could 
be eliminatec I from the function (Eq. 38) used to compute the distortion dj. This term does not significantly affect the 

nr^K° ^ e f t Stort, ° n */. bU * is ,! xpensive to cam W because « involves a logarithm. Additionally, the a posteriori 
probability of the current frame j being noise Pj could be computed using a look-up table. This table would contain the 
possible values for the distortion dj and the corresponding values for the a posteriori probability p, Lastly, the values of 
n, s. <r„. and a s could be updated every few frames, instead of every frame, and the value of % could be kept at its initial 
value and not updated at all. Each of these modifications will improve the efficiency of the input normalizer 22 without 
significantly affecting the accuracy of the input normalizer. 

While the invention has described the normalization of feature vectors only during recognition, the preferred embod- 
iment of the present invention involves the normalization of feature vectors during training as well. Specifically each 
utterance .n the database 26 is normalized according to the principles of the present invention and then the system is 
retramed using the database of normalized utterances. The database of normalized utterances is then used durina 
recognition as described above. u 

One of ordinary skill in the art will now appreciate that the present invention provides a method and system for 
improving speech recognition through front-end normalization of feature vectors. Although the present invention h»s 
been shown and described with reference to a preferred embodiment, equivalent alterations and modifications will occur 
to those ski Ilea m the art upon reaoing and understanding this specification. The present invention includes all such 
equivalent alterations and modifications and is limited only by the scope of the following claims. 

Claims 



1 . A method for improving speech recognition through front-end normalization of feature vectors, speech comprising 
utterances, each utterance comprising frames of speech, each frame of speech being represented by a feature 
40 vector, the method comprising the steps of: Y Teature 

providing a database of known utterances, the database of utterances having an average noise feature vector 
and an average speech feature vector; uw 

receiving a feature vector representing a frame of speech in an utterance to be recognized the frame of 
speech having a probability of being noise, the utterance having an average noise feature vector and an averaa» 
45 speech feature vector; y " 

computing a correction vector based on the probability of the frame of speech being noise and based on the 
average noise and speech feature vectors for the utterance and the database of utterances- and 

computing a normalized feature vector based on the feature vector and the correction vector. 

so 2. The method of claim 1 . wherein the step of receiving a feature vector comprises the step of receiving a cepstral vector. 

3. The method of claim 1 . wherein the probability of the frame of speech being noise and the average noise and speech 
feature vectors for the utterance are updated for each frame of speech. 

The method of claim 1 . wherein the step of computing a correction vector includes the steps of- 

computing the probability of the frame of speech being noise based on a distortion measure of the frame of 

SpeeCn, 

computing the average noise and speech feature vectors for the utterance; 
computing the average noise and speech feature vectors for the database of utterances; and 
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computing the correction vector based on the probability of the frame of speech being noise and the differ- 
ences between the average noise and speech feature vectors for the utterance and the database of utterances. 

A method for improving speech recognition through front-end normalization of feature vectors, speech comprising 
utterances, each utterance comprising frames of speech, each frame of speech being represented by a feature 
vector, the method comprising the steps of: reaiure 

providing a database of known utterances, the database of utterances having an average noise feature vector 

and an average speech feature vector; receiving a feature vector Xj representing a 

frame of speech y in an utterance to be recognized, the frame of speech having an a posteriori probability of 

being noise, the utterance having an average noise feature vector and an average speech feature vector- 
computing a correction vector r(x y ) as: 

>"(x y )=p/n / .. 1 -n wgMl-ppiSj.yS^) ■ 

wherein p, is the aposterovprobability of the frame of speech ybeing noise. n>1 and s» are the average noise and 

tT^LS ** the utterance, and n avg and s avg are the average noise and speech feature vectors for 
the database of utterances; and 

computing a normalized feature vector x y - as: 

x,=x,-r(x y ). 



25 6. The method of claim 5, wherein the step of receiving a feature vector includes the step of receiving a cepstral vector. 

7. The method of claim 5. wherein the a posteriori probability of the frame of speech being noise and the av^raoe 
noise and speech feature vectors for the utterance are updated for each frame of speech. 

so 8. The method of claim 5, wherein the a posterior, probability of the frame of speech j being noise p, is computed as: 

1 ^(^.n J . ! ,2o ( .. I) ) + (l-^//(x /> s > . i ,S. 0 .. I) ) 

wherein i, is an a priori probability of the frame of speech /being noise, M(x,,n A1 ,£n,, n ) and Mx-s-, are 
Gauss,an probabi.ity density functions for noise and speech, respectively a^rn.^and 
40 matnces for noise and speech, respectively. ^ > covarianc e 

9. The method of claim 8. wherein the Gaussian probability density functions for noise and speech M x n , En,- ,0 
and /V(Xy,Sy.i,Es^. 1 j), are computed as: ' r v" 1 '' 



45 



so and 



(x, , n,_, , = —~ jjrexpC-id, - n,_, f l;' 0 ,„ (*, - n,„)) 

(2/r) jI. 0 ._,J ^ 



*(x, , s,_, , I„_ 0 ) = expC-i-Cx, - Sy ., )' 2- - s> _, )) 

{J. TV) |— (jM) | 

wherein Q is a dimension of x,, exp\s an exponential function, and 7 represents a transpose function. 
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1 0. The method of claim 5. wherein the a posteriori probability of the frame of 



speech / being noise pj is computed as 



1 



l+exp^<f(x,)} 



wherein dfx,) is a distortion measure of the frame of speech / 
1 1. The method of claim 10. wherein the distortion measure d{xj) is computed as: 

rf(x,) = (i, - n,.-,)'l;' c ,_ 1)( x, - „,_,) - (Xy _ S; _, )^v-. o o(x> _ Sj _ i} 



12. The method of claim 10. wherein the distortion measure d(x) is computed 



as: 



^m-">-,[/3) 2 (x y [/]- s> .,[/]) 2 



+ ln 



wherein q is a dimension of o n and <j 3 . 
13. The method of claim 10, wherein the distortion 



measure d{x) is computed as: 



I <,-■,[<>] ^ J' 



14. The method of claim 13, wherein the average noise and speech feature 



vectors for the utterance are computed as: 



=0 

CO 



and s ; =±=2_ 



Z^a-^-j*,, 



*=0 



wherein is an exponential window function represented as: 



wherein a is a parameter that controls a rate of adaptation. 
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15. The method of claim 14, wherein the diagonal covariance matrices for noise and speech are computed as: 



n; and <r <y) = ^ _ s > 



75 



16. The method of claim 15. wherein the a pr/or/probabilrty of the frame of speech /being noise I, is computed as: 

as 



— *=0 



I* 



*=0 



* 17. The method of claim 13. wherein the average noise and speech feature vectors for the utterance are computed as: 

n,.=^i! and s.=^- 

wherein 

a n W =P,*/t- o „a „,,,, y c n w =p y +a „c„ ^ , 



35 



40 



a s U )-^-pJ)x j+as a sU _, ) , and c sU) =(1- Pj ) + a s c s{/ . 1} 
and wherein a „ and « s are parameters that control rates of adaptation for noise and speech, respectively. 
18. The method of claim 17. wherein the diagonal covariance matrices for noise and speech are computed as: 



45 

wherein 



so 



^T^-"} and o^-^-s; 



b <y)-^; + «-b w and b^, = (l-^)x; + c ^._ n . 



^ 19. The method of claim 18. wherein the a p,™ probability of the frame of speech /being noise ^is computed as: 

^ y =(1-a„)c nW . 
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20 



25 



vector, the system comprising: 
an input normalizer for: 

receiving a feature vector representing a frame of speech in an utterance to be recognized the frame of 

SSSS^Sr"" ° f b6in9 n ° iSe - UtteranCe h3Ving 3n aVSra9e n ° iSe ,eature -to^ndan avSge 

^r^H^^M COne ^° n vector 1)886(1 on the probability of the frame of speech being noise and based on the 
average no.se and speech feature vectors for the utterance and the database of utterances- and 

computing a normalized feature vector based on the feature vector and the correction vector. 

21. A system for improving speech recognition through front-end normalization of feature vectors speech comorisino 

™ sX^r~ ,ram9s * speech - -* frame - — 

anroe 3 d3tabaSe ° f known ^erances. the utterances being represented by feature models, the database of utter- 
ances having an average no.se feature vector and an average speech feature vector- 

th«. *»£!T re eXt K a , Ct0r extractin9 a feature vector from a ^me of speech in an utterance to be recognized 

j^^s^sr^ of beins noise - ,he -~ havin9 an ~- 

nmtahJL^ n ° rma ' iz f for n ° r ™» ,i2in 9 the feature vector by: (i) computing a correction vector based on the 
probab.l.ty of the frame of speech be.ng noise and based on the average noise and speech feature vectors for he 

a pattern matcher for comparing the normalized feature vector to the feature models in the database. 



30 



35 



40 



45 



50 



55 
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